Search CORE

36 research outputs found

A Collaborative Approach to Computational Reproducibility

Author: Capone Rebecca
Chirigati Fernando
Freire Juliana
Rampin Remi
Shasha Dennis
Publication venue
Publication date: 01/01/2016
Field of study

Although a standard in natural science, reproducibility has been only episodically applied in experimental computer science. Scientific papers often present a large number of tables, plots and pictures that summarize the obtained results, but then loosely describe the steps taken to derive them. Not only can the methods and the implementation be complex, but also their configuration may require setting many parameters and/or depend on particular system configurations. While many researchers recognize the importance of reproducibility, the challenge of making it happen often outweigh the benefits. Fortunately, a plethora of reproducibility solutions have been recently designed and implemented by the community. In particular, packaging tools (e.g., ReproZip) and virtualization tools (e.g., Docker) are promising solutions towards facilitating reproducibility for both authors and reviewers. To address the incentive problem, we have implemented a new publication model for the Reproducibility Section of Information Systems Journal. In this section, authors submit a reproducibility paper that explains in detail the computational assets from a previous published manuscript in Information Systems

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

FigShare

The PBase Scientific Workflow Provenance Repository

Author: Chirigati Fernando
Cuevas-Vicenttín Víctor
Dey Saumen
Kianmajd Parisa
Koop David
Ludäscher Bertram
Missier Paolo
Wei Yaxing
Publication venue: 'Edinburgh University Library'
Publication date: 01/10/2014
Field of study

Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate “reproducible science”. In this context, provenance – information about the origin, context, derivation, ownership, or history of some artifact – plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate the sharing of scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind, we introduce PBase: a scientific workflow provenance repository implementing the ProvONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a user friendly interface tailored for the visualization of scientific workflow provenance data, making the specification of queries and the interpretation of their results easier and more effective

Directory of Open Access Journals

International Journal of Digital Curation

HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

Author: Adhikari
Agirre
Al-Mubaid
Ana García-Serrano
Aouicha
Ashburner
Baker
Banerjee
Banjade
Batet
Batet
Batet
Batet
Batet
Batet
Ben Aouicha
Ben Aouicha
Blanchard
Botsch
Budanitsky
Castellanos
Castellanos
Castells
Chaves-González
Chen
Chirigati
Chirigati
Couto
Couto
Cross
Dagher
de Berg
Dijkman
Editorial
Fernando
Fernando Chirigati
Fokkens
Fähndrich
Gao
Garla
Grego
Guzzi
Hadj Taieb
Hadj Taieb
Hadj Taieb
Hao
Harispe
Harispe
Harispe
Harispe
Hill
Hirst
Jiang
Jiang
Juan J. Lastra-Díaz
Kyogoku
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Leacock
Lee
Leopold
Leopold
Li
Lin
Liu
Lord
Mandreoli
Martinez-Gil
Martínez
Mazandu
McInnes
McInnes
Mehlhorn
Mendling
Meng
Meng
Meng
Merkel
Meymandpour
Mihalcea
Miller
Miller
Miriam Fernández
Montani
Montserrat Batet
Munafò
Oliva
Patwardhan
Patwardhan
Pedersen
Pedersen
Pedersen
Pedersen
Pedersen
Pedersen
Pekar
Pesquita
Petrakis
Pirró
Pirró
Pirró
Pothos
Rada
Resnik
Resnik
Rodríguez
Rubenstein
Schlicker
Schlicker
Sebti
Seco
Seddiqui
Shima
Stanchev
Stojanovic
Sánchez
Sánchez
Sánchez
Sánchez
Tversky
Van Miltenburg
Vrandečić
Wolke
Wolke
Wu
Wu
Yuan
Zhang
Zhou
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in (2015, 2016) [56–58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Open Research Online (The Open University)

The Oberta in open access

YesWorkflow:A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Author: Bertram Ludäscher
Christopher Jones
Christopher Schwalm
David Koop
Fernando Chirigati
James A. Macklin
James Cheney
James Hanken
Juliana Freire
Keith W. Kintigh
Khalid Belhajjame
Mark Bieda
Mark Schildhauer
Paolo Missier
R. Kyle Bocinsky
Saumen Dey
Steve Aulenbach
Tianhong Song
Timothy A. Kohler
Timothy McPhillips
Tyler Kolisnik
Yang Cao
Yaxing Wei
Publication venue: 'Edinburgh University Library'
Publication date: 01/01/2015
Field of study

Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems

arXiv.org e-Print Archive

Directory of Open Access Journals

Edinburgh Research Explorer

International Journal of Digital Curation

311 Dataset

Author: Chirigati Fernando
Publication venue: Harvard Dataverse
Publication date
Field of study

This is a version of the 311 dataset used in the following paper: Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets, F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016 The dataset includes records from 311, a telephone number that provides non-emergency services to New York City, from 2003 to 2014. The original data is available at the NYC Open Data portal

Harvard Dataverse Network

Weather Dataset

Author: Fernando Chirigati (2574763)
Publication venue
Publication date
Field of study

This is a version of the weather dataset used in the following paper:<div><br></div><div><i>Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets, F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016</i></div><div><br></div><div>The dataset includes records of weather data for New York City, from 2010 to 2014.</div><div><br></div><div>The original data is available at the National Climatic Data Center website: http://www7.ncdc.noaa.gov/CDO/dataproduct</div

FigShare

Preserving and Reproducing Research with ReproZip

Author: Fernando Chirigati (2574763)
Publication venue
Publication date
Field of study

Introduction to using ReproZip to aid in reproducing research

FigShare